Journal of Beijing University of Posts and Telecommunications

  • EI核心期刊

JOURNAL OF BEIJING UNIVERSITY OF POSTS AND TELECOM ›› 2009, Vol. 32 ›› Issue (4): 122-127.doi: 10.13190/jbupt.200904.122.shenxy

• Reports • Previous Articles     Next Articles

A Parallable Algorithm for Chinese CoTopic Words Clustering

Jun-Liang Chen Xiang-Wu Meng   

  • Received:2008-11-20 Revised:2009-06-01 Online:2009-08-28 Published:2009-08-28

Abstract:

A simple but powerful algorithm for automatically clustering Chinese cotopic words is presented. The method first uses punctuation ‘、’ to split and extract paratactic Chinese words within sentences from a corpus and constructs a cocitation graph by treating Chinese words as nodes. Second, the method generates several locality sensitive Hashing (LSH) signature combinations for each node in the cocitation graph. Those nodes shared at least one LSH signature combination, are grouped together and most of them may belong to the same topic. The main advantages of the algorithm are the fast speed of calculation and high convenience of implementation in parallel. Experimental results indicate the high efficiency and good clustering effect.

Key words: Chinese word clustering, co-citation graph, connected component, LSH signature, parallable